Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • If | & satisfies multiple conditions

    Hello im having trouble generating a variable from various variables that are mutualy inclusive.

    I need to generate the variable 'diagnosis'

    For this varible i need 3 criteria

    1. variable q60==1
    2. varibale ibs_dx==2
    3. 2 of the next variables must be truth
    -q49>=4
    -q51>=4
    -q52>=4
    -q53>=4
    -q54>=4
    -q55>=4
    *Note whichever combination of 2 of the 6 past items must be to be included. This is the part in which im having trouble generating.

    Can someone help with some ideas?


  • #2
    Hi Dalton,

    Below you find a rough solution.

    Code:
    gen indicator_q49 = 1 if q49>=4
    replace q49=0 if q49==.
    gen indicator_q51 = 1 if q51>=4
    replace q51=0 if q51==.
    gen indicator_q52 = 1 if q52>=4
    replace q52=0 if q52==.
    gen indicator_q53 = 1 if q53>=4
    replace q53=0 if q53==.
    gen indicator_q54 = 1 if q54>=4
    replace q54=0 if q54==.
    gen indicator_q55 = 1 if q55>=4
    replace q55=0 if q55==.
    
    egen sum = rowtotal(indicator_q49 indicator_q51 indicator_q52 indicator_q53 indicator_q54 indicator_q55)
    
    gen diagnosis = 1 if (q60==1) & (ibs_dx==2) & (sum==2)
    Note that by setting sum==2 in the last line you require that only 2 of the 6 statements on your third criteria are true. If for an observation 3 of those criteria are true, using this code, this hypothetical observation would be assigned a missing value. If you want this to be "two or more" of the criteria, then you should change sum==2 to sum>=2 in the last line of the command.

    Comment


    • #3
      You could put everything in one line

      Code:
      gen wanted = (q60 == 1) & (ibs_dx == 2) & ((q49>=4) + (q51>=4) + (q52>=4) + (q53>=4) + (q54>=4) + (q55>=4)) >= 2
      If you wanted to make that clearer in a script, you could space it out, say

      Code:
      gen wanted = (q60 == 1)  ///  
                 & (ibs_dx == 2) ///
                 & ((q49>=4) + (q51>=4) + (q52>=4) + (q53>=4) + (q54>=4) + (q55>=4)) >= 2
      or do something like this

      Code:
      local A (q60 == 1)
      local B (ibs_dx == 2)
      local C ((q49>=4) + (q51>=4) + (q52>=4) + (q53>=4) + (q54>=4) + (q55>=4)) >= 2
      gen wanted = `A' & `B' & `C'
      but note that e.g.

      Code:
      q49 >= 4
      includes missing on q49.
      Last edited by Nick Cox; 29 May 2018, 13:14.

      Comment


      • #4
        Thanks man, i was trying to do something similar but at the last row didnt think on the >=2 on sum.

        Very apretiated.

        Comment


        • #5
          Hi Nick, I found the solution using locals really interesting as I have to generate a variable with a complex set of criteria. The use case for attempting to combine locals is just to create a binary variable to flag whether someone received a final vaccine in a series (that has multiple stipulations). Is using gen and a series of defined locals here even an appropriate solution?

          I tried to assign locals and then combine them using gen as you've done above, but and the code fails with an invalid syntax r(198); error. I'll include just three of the locals, but I would like to add several more:
          Code:
          local A (threedoses == 1) + (vax3 == 4) + (vax1 == 4 | vax2 == 4) + (thirddose >= td(20sep2021)) + (ageatboosted >= 18)
          
          local B (threedoses == 1) + (vax3 == 4) + (vax1 == 4 | vax2 == 4) + ///
          (ageatboosted == 16 | ageboosted == 17) + (inrange(thirddose, td(09dec2021), td(04jan2022)) & aftersecond >= 164 | (aftersecond >= 136 & thirddose > td(04jan2022)))   
          
          local C (threedoses == 1) + (vax3 == 4) + (vax1 == 4 | vax2 == 4) + (ageatboosted >= 12 & ageboosted <= 15) + (aftersecond >= 136 & thirdose >= td(05jan2022))  
          
          gen vaxseries = `A' & `B' & `C'
          Any thoughts? I can include some sample data with dataex if that would be helpful.

          Comment


          • #6
            I believe this will work provided your intent is that all of criteria A, B, and C must be satisfied.

            And I don't understand why you are using + instead of | in these situations. The use of + in #3 was a way of making an "at least two of these..." condition. But you have not imposed any count criteria, so the + would just as easily be handled with |, which would make the code a bit less obscure to read.

            Finally, there is the ever-present threat of operator precedence problems. It does appear to me that you won't run into a problem here, but you can be certain of that by enforcing your intent with parentheses:
            Code:
            gen vaxseries = (`A') & (`B') & (`C')
            This will prevent Stata from and-ing the last term of A with the first term of B rather than and-ing all of A with all of B. As I say, I think with these particular definitions of A, B, and C, this isn't a problem. But it never hurts to use the more robust coding with parentheses.

            Comment


            • #7
              I agree strongly with Clyde Schechter's warning on operator precedence, so much so that in #3 all the macro contents were parenthesised explicitly.

              But all that is involved in using locals is dividing up a complicated expression for readability. It is, or should be, the same calculation.

              To that end, A B C are dopey names. The original question didn't lend itself to anything else but for a real problem something more like


              l
              Code:
              ocal female  (q42 == 1)
              local older     (age >= 65 & age < .) 
              local twopoints   (((q42 == 2) + (q43 == 2) + (q44 == 2) + (q45 == 2)) >= 2) 
              
              gen atrisk = `female' & `older' & `twopoints'
              might help anyone reading the code -- including the programmer at a later time.

              Comment


              • #8
                Thank you both for the input and help. Note taken, Nick, on the local names, this would actually be beneficial to the project.

                The code still fails, but it's user error (I've never used locals before).

                And I don't understand why you are using + instead of | in these situations
                Clyde Schechter After just re-reading #3 more closely, I realize that the + symbol does not mean 'and', which I was actually trying to say. But as you can see I do have several sub criterion that require an | operator.

                This is an example of what I would have tried to do with A prior to stumbling on this thread:
                Code:
                gen vaxseries = .
                replace vaxseries = 1 if threedoses == 1 & vax3 == 4 & (vax1 == 4 | vax2 ==4) & (thirddose >= td(20sep2021)) & (ageatboosted >= 18)
                As I have a lot of vaccination criteria which varies by type, dates, age, missingness, etc., I was attempting to try to create locals in case I needed to tabulate them later by group, or in other ways.

                To create the locals, can I just substitute the + for &, and then use this?
                Code:
                 gen vaxseries = 1 if (`A') & (`B') & (`C')
                Last edited by Nathan Garst; 28 Apr 2022, 11:27. Reason: forgot something

                Comment


                • #9
                  Assuming your intent was to mean "and" when you wrote +, then, yes.

                  One other recommendation. It is not a good idea in Stata to create a variable that is 1 when some logical condition is true but missing value when false. It is much better to create a 1/0 variable. So I would go with
                  Code:
                  gen vaxseries = (`A') & (`B') & (`C')
                  instead.

                  Comment


                  • #10
                    Still getting invalid syntax error. Could it have something to do with the version I have (Stata/MP 14.2)?

                    If there's a syntax error in the local would it also produce an invalid syntax error when combining them with gen?

                    Comment


                    • #11
                      I started testing each sub-criterion and one thing I have noticed in testing is that this one fails with an r(133) error, unknown fucntion ()
                      Code:
                       count if (inrange(thirddose, td(09dec2021), td(04jan2022)) & aftersecond >= 164 | aftersecond >= 136 & thirddose > td(04jan2022))
                      I'm wondering if this could be where the hang-up is?

                      Comment


                      • #12
                        Well, if there is a syntax error in one of the local macros, yes, that will lead to a syntax error in the subsequent -gen- statement.

                        But as for the specific problem you are having in #11, it's puzzling. I created a toy data set with variables thirddose and aftersecond and ran the command you show:
                        Code:
                        . clear
                        
                        . set obs 10
                        Number of observations (_N) was 0, now 10.
                        
                        . gen thirddose = td(07dec2021) + 5*_n
                        
                        . format thirddose %td
                        
                        . set seed 1234
                        
                        . gen aftersecond = runiformint(130, 170)
                        
                        .
                        . list, noobs clean
                        
                            thirddose   afters~d  
                            12dec2021        133  
                            17dec2021        141  
                            22dec2021        135  
                            27dec2021        137  
                            01jan2022        160  
                            06jan2022        138  
                            11jan2022        132  
                            16jan2022        134  
                            21jan2022        143  
                            26jan2022        142  
                        
                        .
                        . count if (inrange(thirddose, td(09dec2021), td(04jan2022)) & aftersecond >= 164 | aftersecond >= 136 & thirddose > td(04jan2022))
                          3
                        and, as you can see, it produced no error message.

                        My best guess is that you are running this from a do-file and that do-file has somehow been contaminated with non-printing characters. In particular, I suspect that the "unknown function ()" that we are seeing is seen by Stata as X(), where X is some non-printing character (or perhaps more than one of them). This can happen when you paste things from word-processing documents, or from web pages (including Statalist! :-( ) into your do-file. These non-printing characters are used by these other applications to control formatting. They do not directly appear on the screen, but when included in a command, they are visible, and incomprehensible, to Stata. When Stata writes them to the screen, nothing shows up. So they are invisible to human eyes.

                        Sometimes if you take the same command that is giving you problems and launder it through a plain bare-bones text editor, the problem will go away. If that doesn't work, deleting the entire line(s) containing the command and retyping them in from the keyboard usually solves the problem.

                        Comment


                        • #13
                          Clyde Schechter This was the answer!
                          My best guess is that you are running this from a do-file and that do-file has somehow been contaminated with non-printing characters
                          I just retyped the whole thing above it and it worked. Thank you.

                          Comment


                          • #14
                            Hello, I am new to this forum and to using STATA. My question is along the lines of this post, so I am asking here.
                            I want to code my exposure variable 'ADDx' (eczema diagnosis) as follows:
                            1. ADDx==1 if the participant reports at least 2 positive responses.
                              I did that using a syntax mentioned in this thread, thanks for that:
                              Code:
                              gen ADDx = ((eczema_1==1) + (eczema_2==1) + (eczema_3==1) + (eczema_4==1) + (eczema_5==1) + (eczema_6==1) + (eczema_7==1) + (eczema_8==1) + (eczema_9==1) + (eczema_10==1)) >= 2
                              Code:
                              tab ADDx
                              ADDx | Freq. Percent Cum.
                              ------------+-----------------------------------
                              0 9,706 69.55 69.55
                              1 4,250 30.45 100.00
                              ------------+-----------------------------------
                              Total | 13,956 100.00
                            2. Now I want to exclude from ADDx==0, those with only 1 reported eczema at any time point, to be specified as an indeterminate eczema diagnosis.
                              In other words, I want my comparator group to include only those with no reported eczema at all 10 timepoints. I did so using:
                              Code:
                              gen AD=.
                              	replace AD=0 if ((eczema_1==0) & (eczema_2==0) & (eczema_3==0) & (eczema_4==0) & (eczema_5==0) & (eczema_6==0) & (eczema_7==0) & (eczema_8==0) & (eczema_9==0) & (eczema_10==0))
                              	replace AD=1 if ADDx==1
                              Is this correct?
                            3. Similarly, I have a variable for the maternal highest educational level and a variable for the partner's highest edu level. I want to generate a new variable whereby at least 1 parent has a degree.
                              Here are the 2 variables tabulated:
                              Partners highest ed qualification | Freq. Percent Cum.
                              -----------------------------------+-----------------------------------
                              -1. Missing | 1,773 12.70 12.70
                              1. level1 | 1,894 13.57 26.28
                              2. level 2 | 1,008 7.22 33.50
                              3. level 3 | 2,539 18.19 51.69
                              4. level 4 | 3,097 22.19 73.88
                              5. Degree | 2,165 15.51 89.40
                              . | 1,480 10.60 100.00
                              -----------------------------------+-----------------------------------
                              Total | 13,956 100.00
                              Mums highest ed qualification | Freq. Percent Cum.
                              -----------------------------------+-----------------------------------
                              -1. Missing | 844 6.05 6.05
                              1. level1 | 1,737 12.45 18.49
                              2. level2 | 1,224 8.77 27.26
                              3. level3 | 4,290 30.74 58.00
                              4. level4 | 2,783 19.94 77.94
                              5. Degree | 1,598 11.45 89.40
                              . | 1,480 10.60 100.00
                              -----------------------------------+-----------------------------------
                              Total | 13,956 100.00

                              Which of the below 2 methods, if any, is correct? (each leads to different numbers under missing and zero)
                              Method 1:
                              Code:
                              gen degree_parent = ((mum==5) + (par==5)) >= 1
                              	replace degree_parent=. if mum==. & par==.
                              	replace degree_parent=0 if ((par==1) + (par==2) + (par==3) + (par==4) + (mum==1) + (mum==2) + (mum==3) + (mum==4))
                              Method 2:
                              Code:
                              gen parent_edu=.
                              	replace parent_edu=1 if degree_parent==1
                              	replace parent_edu=0 if ((par==1) + (par==2) + (par==3) + (par==4) + (mum==1) + (mum==2) + (mum==3) + (mum==4))
                              	replace parent_edu=. if par==. & mum==.
                              Would appreciate your input.
                            Last edited by Rita Iskandar; 17 Oct 2022, 07:30.
                            Rita
                            Stata SE 17.0

                            Comment


                            • #15
                              Regarding your first question, what you are doing is possibly OK, but you are doing it in a messy and difficult way.

                              One easy and transparent way to do what you want be to first create a score that adds up the number of 1 values each person has across all of your eczema variables. You didn't tell us how your eczema variables are coded. If they are all are coded 0/1 and missing, you could do this:
                              Code:
                              egen eczema_sum = rowtotal(eczema*)
                              If you have some other coding scheme, but with 1 indicating the presence of eczema, you could do this:
                              Code:
                              egen eczema_sum = anycount(eczema*), values(1)
                              Then you could use -replace- to create your variable as desired. You did not say what value you want to give to persons with a sum of 1, so you will have to modify the following to choose what value you want to represent "indeterminate." I'm thinking you might want a missing value, for which I'm choosing the default Stata value. (See -help missing- Missing values are handled with values like ., .a, .b, etc. in Stata.)

                              Code:
                              gen ADDx = (eczema_sum > 2)  //  0 or 1
                              replace ADDx = . if (eczema == 1) // indeterminate.
                              The -egen- set of commands are important to know. My advice to new users is that if you are trying to create a variable but are having trouble, you should look at those commands using -help egen-.

                              Regarding your second question: I find confusing what you describe, as I'm not sure what your variables are and how they are coded. Understanding what you want and giving a helpful answer would likely be easy had you followed the advice to post example data as described in the StataList FAQ that new list members like you are asked to read. (Tab at the top left of your StataList screen.) Read there about the -dataex- command and post some example data for the variables relevant to your second question here.

                              Finally, reading -help missing- should help you learn about how missing values are handled in Stata. You will find that using numbers like -1 is not useful.
                              Last edited by Mike Lacy; 17 Oct 2022, 08:20. Reason: Fixed some typos.

                              Comment

                              Working...
                              X